Here we are importing all the Libraries and Modules that are needed for whole project in a single cell.
# Libraries for Basic Process
import numpy as np
import pandas as pd
import string as st
# Libraries for Visualization
import matplotlib.pyplot as plt
import matplotlib.image as mplib
import seaborn as sns
%matplotlib inline
# Pre-setting Plot Style
font={'size':15}
plt.rc('font', **font)
plt.rc('xtick',labelsize=12)
plt.rc('ytick',labelsize=12)
sns.set_style({'xtick.bottom':True,'ytick.left':True,'text.color':'#9400D3',
'axes.labelcolor': 'blue','patch.edgecolor': 'black'})
# sklearn Modules
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn import metrics
# scipy Modules
from scipy.stats import zscore
from scipy.spatial.distance import cdist, pdist
from scipy.cluster.hierarchy import cophenet, dendrogram, linkage
# Supporting Module
from imblearn.over_sampling import SMOTE
# Module to Suppress Warnings
from warnings import filterwarnings
filterwarnings('ignore')
DOMAIN: Automobile
CONTEXT: The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes.
DATA DESCRIPTION: The data concerns city-cycle fuel consumption in miles per gallon.
PROJECT OBJECTIVE: Goal is to cluster the data and treat them as individual datasets to train Regression models to predict ‘mpg’.
Steps and tasks:
Data analysis & visualisation:
Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis.\
Hint: Use your best analytical approach. Even you can mix match columns to create new columns which can be used for better analysis. Create your own features if required. Be highly experimental and analytical here to find hidden patterns.
# Loading the files and creating dataframes
CarAttri = pd.read_json('Car-Attributes.json')
CarName = pd.read_csv('Car name.csv')
# Getting Shape and Size of each data
CA = CarAttri.shape
CN = CarName.shape
# Displaying Car Attributes Dataset
print('\033[1m1. Car Attributes Dataset consist:-\033[0m\n Number of Rows =',CA[0],'\n Number of Columns =',CA[1])
display(CarAttri.head())
print('_________________________________________________________________________________')
# Displaying Car Name Dataset
print('\033[1m\n2. Car Name Dataset consist:-\033[0m\n Number of Rows =',CN[0],'\n Number of Columns =',CN[1])
display(CarName.head())
1. Car Attributes Dataset consist:-
Number of Rows = 398
Number of Columns = 8
| mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | 1 |
| 1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | 1 |
| 2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | 1 |
| 3 | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | 1 |
| 4 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | 1 |
_________________________________________________________________________________
2. Car Name Dataset consist:-
Number of Rows = 398
Number of Columns = 1
| car_name | |
|---|---|
| 0 | chevrolet chevelle malibu |
| 1 | buick skylark 320 |
| 2 | plymouth satellite |
| 3 | amc rebel sst |
| 4 | ford torino |
Key Observations:-
# Merging two Datasets
cardata = pd.concat([CarAttri, CarName], axis=1)
# Getting Shape and Size of final dataset
CD = cardata.shape
# Displaying Final Dataset
print('\033[1mFinal Dataset consist:-\033[0m\n Number of Rows =',CD[0],'\n Number of Columns =',CD[1])
display(cardata.head(10))
Final Dataset consist:-
Number of Rows = 398
Number of Columns = 9
| mpg | cyl | disp | hp | wt | acc | yr | origin | car_name | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
| 1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | 1 | buick skylark 320 |
| 2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | 1 | plymouth satellite |
| 3 | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | 1 | amc rebel sst |
| 4 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | 1 | ford torino |
| 5 | 15.0 | 8 | 429.0 | 198 | 4341 | 10.0 | 70 | 1 | ford galaxie 500 |
| 6 | 14.0 | 8 | 454.0 | 220 | 4354 | 9.0 | 70 | 1 | chevrolet impala |
| 7 | 14.0 | 8 | 440.0 | 215 | 4312 | 8.5 | 70 | 1 | plymouth fury iii |
| 8 | 14.0 | 8 | 455.0 | 225 | 4425 | 10.0 | 70 | 1 | pontiac catalina |
| 9 | 15.0 | 8 | 390.0 | 190 | 3850 | 8.5 | 70 | 1 | amc ambassador dpl |
Key Observations:-
# Storing our Final Dataset in .csv format
cardata.to_csv('cardata.csv')
Key Observations:-
# Importing the stored dataset
cardata = pd.read_csv('cardata.csv')
# Displaying the Dataset
print('\033[1mDataset:-')
display(cardata.head(10))
Dataset:-
| Unnamed: 0 | mpg | cyl | disp | hp | wt | acc | yr | origin | car_name | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
| 1 | 1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | 1 | buick skylark 320 |
| 2 | 2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | 1 | plymouth satellite |
| 3 | 3 | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | 1 | amc rebel sst |
| 4 | 4 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | 1 | ford torino |
| 5 | 5 | 15.0 | 8 | 429.0 | 198 | 4341 | 10.0 | 70 | 1 | ford galaxie 500 |
| 6 | 6 | 14.0 | 8 | 454.0 | 220 | 4354 | 9.0 | 70 | 1 | chevrolet impala |
| 7 | 7 | 14.0 | 8 | 440.0 | 215 | 4312 | 8.5 | 70 | 1 | plymouth fury iii |
| 8 | 8 | 14.0 | 8 | 455.0 | 225 | 4425 | 10.0 | 70 | 1 | pontiac catalina |
| 9 | 9 | 15.0 | 8 | 390.0 | 190 | 3850 | 8.5 | 70 | 1 | amc ambassador dpl |
Key Observations:-
# Checking for Null Values in the Attributes
print('\n\033[1mNull Values in the Features:-')
display(cardata.isnull().sum().to_frame('Null Values'))
Null Values in the Features:-
| Null Values | |
|---|---|
| Unnamed: 0 | 0 |
| mpg | 0 |
| cyl | 0 |
| disp | 0 |
| hp | 0 |
| wt | 0 |
| acc | 0 |
| yr | 0 |
| origin | 0 |
| car_name | 0 |
Key Observations:-
# Dropping "Unnamed:0" and "car_name" Attributes
cardata.drop(['Unnamed: 0','car_name'],axis=1,inplace=True)
# Displaying the Dataset
print('\033[1mDataset:-')
display(cardata.head(10))
Dataset:-
| mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | 1 |
| 1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | 1 |
| 2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | 1 |
| 3 | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | 1 |
| 4 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | 1 |
| 5 | 15.0 | 8 | 429.0 | 198 | 4341 | 10.0 | 70 | 1 |
| 6 | 14.0 | 8 | 454.0 | 220 | 4354 | 9.0 | 70 | 1 |
| 7 | 14.0 | 8 | 440.0 | 215 | 4312 | 8.5 | 70 | 1 |
| 8 | 14.0 | 8 | 455.0 | 225 | 4425 | 10.0 | 70 | 1 |
| 9 | 15.0 | 8 | 390.0 | 190 | 3850 | 8.5 | 70 | 1 |
Key Observations:-
# Creating a string of letters and punctuations
Alpbt = st.ascii_letters + st.punctuation
# Checking the values of Dataset
for i in Alpbt:
for x in cardata:
for y in cardata[x]:
if i == y:
a = y
b = x
# Displaying Specific rows where Dirty Values present.
print('\033[1mDirty Values in the Dataset:-')
dirty = cardata[cardata[b]==a]
display(dirty)
Dirty Values in the Dataset:-
| mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|
| 32 | 25.0 | 4 | 98.0 | ? | 2046 | 19.0 | 71 | 1 |
| 126 | 21.0 | 6 | 200.0 | ? | 2875 | 17.0 | 74 | 1 |
| 330 | 40.9 | 4 | 85.0 | ? | 1835 | 17.3 | 80 | 2 |
| 336 | 23.6 | 4 | 140.0 | ? | 2905 | 14.3 | 80 | 1 |
| 354 | 34.5 | 4 | 100.0 | ? | 2320 | 15.8 | 81 | 2 |
| 374 | 23.0 | 4 | 151.0 | ? | 3035 | 20.5 | 82 | 1 |
Key Observations:-
# Replacing '?' by 0
cardata['hp'].replace('?',0,inplace=True)
# Displaying Dataset After Replacing
print('\n\033[1mDataset After Replacing:-')
index = dirty.index
display(cardata.iloc[index])
Dataset After Replacing:-
| mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|
| 32 | 25.0 | 4 | 98.0 | 0 | 2046 | 19.0 | 71 | 1 |
| 126 | 21.0 | 6 | 200.0 | 0 | 2875 | 17.0 | 74 | 1 |
| 330 | 40.9 | 4 | 85.0 | 0 | 1835 | 17.3 | 80 | 2 |
| 336 | 23.6 | 4 | 140.0 | 0 | 2905 | 14.3 | 80 | 1 |
| 354 | 34.5 | 4 | 100.0 | 0 | 2320 | 15.8 | 81 | 2 |
| 374 | 23.0 | 4 | 151.0 | 0 | 3035 | 20.5 | 82 | 1 |
Key Observations:-
# Displaying Data types of dataset
print('\n\033[1mData Types of Each Attribute:-')
display(cardata.dtypes.to_frame('Data Type'))
Data Types of Each Attribute:-
| Data Type | |
|---|---|
| mpg | float64 |
| cyl | int64 |
| disp | float64 |
| hp | object |
| wt | int64 |
| acc | float64 |
| yr | int64 |
| origin | int64 |
Key Observations:-
#Converting 'hp' Attribute datatype to integer
cardata['hp'] = cardata['hp'].astype('int64')
# Displaying Data types of dataset
print('\n\033[1mData Types of Each Attribute:-')
display(cardata.dtypes.to_frame('Data Type'))
Data Types of Each Attribute:-
| Data Type | |
|---|---|
| mpg | float64 |
| cyl | int64 |
| disp | float64 |
| hp | int64 |
| wt | int64 |
| acc | float64 |
| yr | int64 |
| origin | int64 |
Key Observations:-
# Replacing 0 by Mean value of hp attribute
cardata['hp'].replace(0,round(cardata['hp'].mean()),inplace=True)
Key Observations:-
# Getting index of dirty data
index = dirty.index
# Displaying Before Correction/Treatement Process
print('\n\033[1mBefore Correction/Treatement Process:-')
display(dirty)
print('_______________________________________________________')
# Displaying After Correction/Treatement Process
print('\n\033[1mAfter Correction/Treatement Process:-')
display(cardata.iloc[index])
Before Correction/Treatement Process:-
| mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|
| 32 | 25.0 | 4 | 98.0 | ? | 2046 | 19.0 | 71 | 1 |
| 126 | 21.0 | 6 | 200.0 | ? | 2875 | 17.0 | 74 | 1 |
| 330 | 40.9 | 4 | 85.0 | ? | 1835 | 17.3 | 80 | 2 |
| 336 | 23.6 | 4 | 140.0 | ? | 2905 | 14.3 | 80 | 1 |
| 354 | 34.5 | 4 | 100.0 | ? | 2320 | 15.8 | 81 | 2 |
| 374 | 23.0 | 4 | 151.0 | ? | 3035 | 20.5 | 82 | 1 |
_______________________________________________________
After Correction/Treatement Process:-
| mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|
| 32 | 25.0 | 4 | 98.0 | 103 | 2046 | 19.0 | 71 | 1 |
| 126 | 21.0 | 6 | 200.0 | 103 | 2875 | 17.0 | 74 | 1 |
| 330 | 40.9 | 4 | 85.0 | 103 | 1835 | 17.3 | 80 | 2 |
| 336 | 23.6 | 4 | 140.0 | 103 | 2905 | 14.3 | 80 | 1 |
| 354 | 34.5 | 4 | 100.0 | 103 | 2320 | 15.8 | 81 | 2 |
| 374 | 23.0 | 4 | 151.0 | 103 | 3035 | 20.5 | 82 | 1 |
Key Observations:-
# Describing the data interms of count, mean, standard deviation, and 5 point summary
print('\n\033[1mBrief Summary of Dataset:-')
display(cardata.describe())
Brief Summary of Dataset:-
| mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|
| count | 398.000000 | 398.000000 | 398.000000 | 398.000000 | 398.000000 | 398.000000 | 398.000000 | 398.000000 |
| mean | 23.514573 | 5.454774 | 193.425879 | 104.447236 | 2970.424623 | 15.568090 | 76.010050 | 1.572864 |
| std | 7.815984 | 1.701004 | 104.269838 | 38.199608 | 846.841774 | 2.757689 | 3.697627 | 0.802055 |
| min | 9.000000 | 3.000000 | 68.000000 | 46.000000 | 1613.000000 | 8.000000 | 70.000000 | 1.000000 |
| 25% | 17.500000 | 4.000000 | 104.250000 | 76.000000 | 2223.750000 | 13.825000 | 73.000000 | 1.000000 |
| 50% | 23.000000 | 4.000000 | 148.500000 | 95.000000 | 2803.500000 | 15.500000 | 76.000000 | 1.000000 |
| 75% | 29.000000 | 8.000000 | 262.000000 | 125.000000 | 3608.000000 | 17.175000 | 79.000000 | 2.000000 |
| max | 46.600000 | 8.000000 | 455.000000 | 230.000000 | 5140.000000 | 24.800000 | 82.000000 | 3.000000 |
# Checking skewness of the data attributes
print('\033[1mSkewness of all attributes:-')
display(cardata.skew().to_frame(name='Skewness'))
Skewness of all attributes:-
| Skewness | |
|---|---|
| mpg | 0.457066 |
| cyl | 0.526922 |
| disp | 0.719645 |
| hp | 1.097264 |
| wt | 0.531063 |
| acc | 0.278777 |
| yr | 0.011535 |
| origin | 0.923776 |
# Checking Covariance related with all independent attributes
print('\033[1mCovariance between all attributes:-')
display(cardata.cov())
Covariance between all attributes:-
| mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|
| mpg | 61.089611 | -10.308911 | -655.402318 | -230.423159 | -5505.211745 | 9.058930 | 16.741163 | 3.532185 |
| cyl | -10.308911 | 2.893415 | 168.623214 | 54.536651 | 1290.695575 | -2.370842 | -2.193499 | -0.767477 |
| disp | -655.402318 | 168.623214 | 10872.199152 | 3560.844316 | 82368.423240 | -156.332976 | -142.717137 | -50.964989 |
| hp | -230.423159 | 54.536651 | 3560.844316 | 1459.210055 | 27848.819690 | -72.119698 | -58.188385 | -13.894131 |
| wt | -5505.211745 | 1290.695575 | 82368.423240 | 27848.819690 | 717140.990526 | -974.899011 | -959.946344 | -394.639330 |
| acc | 9.058930 | -2.370842 | -156.332976 | -72.119698 | -974.899011 | 7.604848 | 2.938105 | 0.455354 |
| yr | 16.741163 | -2.193499 | -142.717137 | -58.188385 | -959.946344 | 2.938105 | 13.672443 | 0.535790 |
| origin | 3.532185 | -0.767477 | -50.964989 | -13.894131 | -394.639330 | 0.455354 | 0.535790 | 0.643292 |
# Checking Variance data attributes
print('\033[1m\nVariance of all attributes:-')
display(cardata.var().to_frame(name='Variance'))
Variance of all attributes:-
| Variance | |
|---|---|
| mpg | 61.089611 |
| cyl | 2.893415 |
| disp | 10872.199152 |
| hp | 1459.210055 |
| wt | 717140.990526 |
| acc | 7.604848 |
| yr | 13.672443 |
| origin | 0.643292 |
# Getting Interquartile Range of data attributes
print('\033[1mIQR of all attributes:-')
display((cardata.quantile(0.75) - cardata.quantile(0.25)).to_frame(name='Interquartile Range'))
IQR of all attributes:-
| Interquartile Range | |
|---|---|
| mpg | 11.50 |
| cyl | 4.00 |
| disp | 157.75 |
| hp | 49.00 |
| wt | 1384.25 |
| acc | 3.35 |
| yr | 6.00 |
| origin | 1.00 |
# Checking Correlation by plotting Heatmap for all attributes
print('\033[1mHeatmap showing Correlation of Data attributes:-')
plt.figure(figsize=(12,8))
plt.title('Correlation of Data Attributes\n')
sns.heatmap(cardata.corr(),annot=True,fmt= '.2f',cmap='magma');
plt.show()
Heatmap showing Correlation of Data attributes:-
Key Observations:-
Univariate analysis is the simplest form of analyzing data. It involves only one variable.
We will use these functions for easy analysis of individual attribute.
# Creating Plot function for Quantitative Attributes
def qt_data(x):
# Plotting Distribution for Quantitative attribute
print(f'\033[1mPlot Showing Distribution of Feature "{x}":-')
plt.figure(figsize=(12,6))
plt.title(f'Distribution of "{x}"\n')
sns.distplot(cardata[x],color='#9400D3');
print('')
plt.show()
print('\n__________________________________________________________________________________________________\n')
print('')
# Box plot for Quantitative data
print(f'\033[1mPlot Showing 5 point summary with outliers of Attribute "{x}":-\n')
plt.figure(figsize=(12,6))
plt.title(f'Box Plot for "{x}"\n')
sns.boxplot(cardata[x],color="#9400D3");
plt.show()
# Creating Plot function for Categorical Attributes
def cat_data(x):
# Plotting Frequency Distribution of categorical attribute
colors = ['gold','tomato','yellowgreen','blue','pink','#ADD8E6']
print(f'\033[1mPlot Showing Frequency Distribution of Attribute "{x}":-')
plt.figure(figsize=(10,8))
plt.title(f'Frequencies of "{x}" Attribute\n')
sns.countplot(cardata[x],palette='bright');
plt.show()
print('\n___________________________________________________________________________________')
print('')
# Plotting Pie Chart to check contribution of categorical attribute
print(f'\033[1m\nPie Chart Showing Contribution of Each Category of "{x}" feature:-\n')
plt.title(f'Contribution of Each Category of "{x}" Attribute\n\n\n\n\n\n')
cardata[x].value_counts().plot.pie(radius=2.5,shadow=True,autopct='%1.1f%%',colors=colors);
plt.legend(loc='right',prop={'size': 12}, bbox_to_anchor=(2, 1))
plt.show()
# Univariate analysis for mpg Attribute
qt_data('mpg')
Plot Showing Distribution of Feature "mpg":-
__________________________________________________________________________________________________
Plot Showing 5 point summary with outliers of Attribute "mpg":-
# Univariate analysis for cyl Attribute
cat_data('cyl')
Plot Showing Frequency Distribution of Attribute "cyl":-
___________________________________________________________________________________
Pie Chart Showing Contribution of Each Category of "cyl" feature:-
# Univariate analysis for disp Attribute
qt_data('disp')
Plot Showing Distribution of Feature "disp":-
__________________________________________________________________________________________________
Plot Showing 5 point summary with outliers of Attribute "disp":-
# Univariate analysis for hp Attribute
qt_data('hp')
Plot Showing Distribution of Feature "hp":-
__________________________________________________________________________________________________
Plot Showing 5 point summary with outliers of Attribute "hp":-
# Univariate analysis for wt Attribute
qt_data('wt')
Plot Showing Distribution of Feature "wt":-
__________________________________________________________________________________________________
Plot Showing 5 point summary with outliers of Attribute "wt":-
# Univariate analysis for acc Attribute
qt_data('acc')
Plot Showing Distribution of Feature "acc":-
__________________________________________________________________________________________________
Plot Showing 5 point summary with outliers of Attribute "acc":-
# Univariate analysis for yr Attribute
cat_data('yr')
Plot Showing Frequency Distribution of Attribute "yr":-
___________________________________________________________________________________
Pie Chart Showing Contribution of Each Category of "yr" feature:-
# Univariate analysis for origin Attribute
cat_data('origin')
Plot Showing Frequency Distribution of Attribute "origin":-
___________________________________________________________________________________
Pie Chart Showing Contribution of Each Category of "origin" feature:-
Bivariate Analysis is performed to find the relationship between Quantitative Variable and Categorical variable of dataset.
To do analysis here we are using Violin plots because Violin plot depicts distributions of numeric data for one or more groups using density curves. The width of each curve corresponds with the approximate frequency of data points in each region.
# Creating Plot function for Categorical VS All Quantitative Attribute
def bi_Anly(x):
# Bivariate Analysis for Categorical VS All Quantitative Attributes
print(f'\033[1m\nPlots Showing Bivariate Analysis of "{x}" VS All Categorical Attributes:-\n')
# Setting up Sub-Plots
fig, axes = plt.subplots(3, 2, figsize=(13, 16))
fig.suptitle(f'"{x}" VS All Categorical Attributes')
plt.subplots_adjust(left=0.1,bottom=0.1, right=0.9, top=0.94, wspace=0.3, hspace=0.4)
# Plotting Sub-Plots
sns.violinplot(ax=axes[0, 0], x=x, y='mpg', data=cardata, palette='bright');
sns.violinplot(ax=axes[0, 1], x=x, y='disp', data=cardata, palette='bright');
sns.violinplot(ax=axes[1, 0], x=x, y='hp', data=cardata, palette='bright');
sns.violinplot(ax=axes[1, 1], x=x, y='wt', data=cardata, palette='bright');
sns.violinplot(ax=axes[2, 0], x=x, y='acc', data=cardata, palette='bright');
plt.show()
Bivariate Analysis 1: cyl VS All Quantitative Attributes
# cyl VS All Quantitative Attributes
bi_Anly('cyl')
Plots Showing Bivariate Analysis of "cyl" VS All Categorical Attributes:-
Bivariate Analysis 2: yr VS All Quantitative Attributes
# yr VS All Quantitative Attributes
bi_Anly('yr')
Plots Showing Bivariate Analysis of "yr" VS All Categorical Attributes:-
Bivariate Analysis 3: origin VS All Quantitative Attributes
# origin VS All Quantitative Attributes
bi_Anly('origin')
Plots Showing Bivariate Analysis of "origin" VS All Categorical Attributes:-
# Multivariate Analysis of Attributes
print('\033[1mPlot Showing Multivariate Analysis to check Relation between Attributes:-')
# Plotting pairplot for Attributes
sns.pairplot(cardata,plot_kws={'color':'#9400D3'},diag_kws={'color':'Gold'}).fig.suptitle('Relation Between Attributes',y=1.04);
plt.show()
Plot Showing Multivariate Analysis to check Relation between Attributes:-
# Plotting Heatmap for checking Correlation
print('\033[1mHeatmap showing Correlation of Data attributes:-')
plt.figure(figsize=(12,8))
plt.title('Correlation of Data Attributes\n')
sns.heatmap(cardata.corr(),annot=True,fmt= '.2f',cmap='Spectral');
plt.show()
Heatmap showing Correlation of Data attributes:-
NOTE:- Here we are Replacing Outliers by Mean of the Attribute without outliers. That is we will calculate Mean without outliers and then replace outliers with this calculated Mean
# Getting Outliers and Imputing Outliers by Mean
# Creating list of Discrete Feature column with additional empty lists that are required
clm = cardata.columns[0:-1]
AT = []
OL1 = []
OL2 = []
M1 = []
M2 = []
for i in clm:
AT.append(i)
# Getting Interquartile Range
q1 = cardata[i].quantile(0.25)
q3 = cardata[i].quantile(0.75)
IQR = q3 - q1
# Getting Mean of each Attribute having Outliers (i.e including outliers)
M1.append(round(cardata[i].mean(),2))
# Getting Outlier and Normal Values Seperated
OL = []
NOL = []
for k in cardata[i]:
if (k < (q1 - 1.5 * IQR) or k > (q3 + 1.5 * IQR)):
OL.append(k)
else:
NOL.append(k)
OL1.append(len(OL))
# Replacing Outliers by Mean of Normal Values
cardata[i].replace(OL,np.mean(NOL),inplace=True) # Here we are imputing outliers by Mean of attribute without outlier
M2.append(round(np.mean(NOL),2))
# Getting Outliers After Imputation
OL_cnt = 0
for k in cardata[i]:
if (k < (q1 - 1.5 * IQR) or k > (q3 + 1.5 * IQR)):
OL_cnt += 1
OL2.append(OL_cnt)
# Creting dataframe for better representation of Outlier Analysis
Outlier_Analysis = pd.DataFrame({'Attribute':AT,
'Mean Including Outliers':M1,
'Outliers Before Imputation':OL1,
'Mean Excluding Outliers':M2,
'Outliers After Imputation':OL2})
print('\033[1mTotal Outliers Observed in Discrete Attributes =',sum(OL1))
print('\n\033[1mTable Showing Outlier Analysis:-')
display(Outlier_Analysis)
Total Outliers Observed in Discrete Attributes = 21 Table Showing Outlier Analysis:-
| Attribute | Mean Including Outliers | Outliers Before Imputation | Mean Excluding Outliers | Outliers After Imputation | |
|---|---|---|---|---|---|
| 0 | mpg | 23.51 | 1 | 23.46 | 0 |
| 1 | cyl | 5.45 | 0 | 5.45 | 0 |
| 2 | disp | 193.43 | 0 | 193.43 | 0 |
| 3 | hp | 104.45 | 11 | 101.25 | 0 |
| 4 | wt | 2970.42 | 0 | 2970.42 | 0 |
| 5 | acc | 15.57 | 9 | 15.50 | 0 |
| 6 | yr | 76.01 | 0 | 76.01 | 0 |
Key Observations:-
Scaling is needed for our data, we will scale all discrete features with Z-Score Normalization. By Standardizing the values of dataset, we get the following statistics of the data distribution,
# Applying Z-Scores
Scaled_Data = cardata.apply(zscore)
# Checking the Mean and Standard Deviation
print('\033[1mTable Showing Mean and Standard Deviation of Scaled Attributes:-')
display(cardata[clm].describe()[1:3].T)
Table Showing Mean and Standard Deviation of Scaled Attributes:-
| mean | std | |
|---|---|---|
| mpg | 23.456423 | 7.729413 |
| cyl | 5.454774 | 1.701004 |
| disp | 193.425879 | 104.269838 |
| hp | 101.245478 | 33.100248 |
| wt | 2970.424623 | 846.841774 |
| acc | 15.501542 | 2.497555 |
| yr | 76.010050 | 3.697627 |
Key Observations:-
# Finding optimal no. of clusters K
MD=[]
for k in range(1,10):
model = KMeans(n_clusters=k)
model.fit(Scaled_Data)
MD.append(sum(np.min(cdist(Scaled_Data, model.cluster_centers_, 'euclidean'), axis=1)) / Scaled_Data.shape[0])
# OR --> MD.append(model.inertia_) Also Possible
# Displaying plot for Selecting k with the Elbow Method
print('\033[1mPlot for Selecting k with the Elbow Method:-')
plt.plot(range(1,10), MD, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('\nSelecting k with the Elbow Method\n')
plt.show()
Plot for Selecting k with the Elbow Method:-
Key Observations:-
# Fitting the Model with K = 2
Model1 = KMeans(2)
Model1.fit(Scaled_Data)
pred = Model1.predict(Scaled_Data)
# Appending the prediction to our dataset
data1 = cardata.copy()
data2 = Scaled_Data.copy()
data1['GROUP'] = pred
data2['GROUP'] = pred
# Displaying Dataset
print('\033[1mDataset:-')
display(data1.head())
Dataset:-
| mpg | cyl | disp | hp | wt | acc | yr | origin | GROUP | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | 1 | 0 |
| 1 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | 1 | 0 |
| 2 | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | 1 | 0 |
| 3 | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | 1 | 0 |
| 4 | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | 1 | 0 |
Key Observations:-
# Analyze the distribution of the data among the Groups
print('\033[1mTable for Analyzing the distribution of the data among the Groups:-')
data = data1.groupby(['GROUP'])
display(data.mean())
# Plotting Boxplot for Visualization
print('\033[1m\nPlot for Visualization:-')
data2.boxplot(by='GROUP', layout = (3,3), figsize=(15,8))
plt.show()
Table for Analyzing the distribution of the data among the Groups:-
| mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|
| GROUP | ||||||||
| 0 | 16.440237 | 7.218935 | 299.485207 | 128.601777 | 3785.698225 | 14.441457 | 74.366864 | 1.017751 |
| 1 | 28.634308 | 4.152838 | 115.155022 | 81.056769 | 2368.759825 | 16.283876 | 77.222707 | 1.982533 |
Plot for Visualization:-
Key Observations:-
# Fitting the Model
Model2 = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='average')
Model2.fit(Scaled_Data)
# Appending Labels
cardata['LABELS'] = Model2.labels_
Scaled_Data['LABELS'] = Model2.labels_
# Analyze the distribution of the data among the Labels
print('\033[1mTable for Analyzing the distribution of the data among the Labels:-')
data = cardata.groupby(['LABELS'])
display(data.mean())
Table for Analyzing the distribution of the data among the Labels:-
| mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|
| LABELS | ||||||||
| 0 | 16.880220 | 7.131868 | 290.846154 | 126.690661 | 3726.769231 | 14.515968 | 74.659341 | 1.082418 |
| 1 | 28.997483 | 4.041667 | 111.340278 | 79.805556 | 2333.134259 | 16.331980 | 77.148148 | 1.986111 |
Key Observations:-
# Linkage
Link = linkage(Scaled_Data, metric='euclidean', method='average')
c, coph_dists = cophenet(Link , pdist(Scaled_Data))
# Plotting Dendogram
print('\033[1mDendogram:-')
plt.figure(figsize=(10, 5))
plt.title('Agglomerative Hierarchical Clustering Dendogram\n')
plt.xlabel('Sample index')
plt.ylabel('Distance')
dendrogram(Link, leaf_rotation=90.0, p=12, leaf_font_size=10, truncate_mode='lastp')
plt.tight_layout()
plt.show()
print('___________________________________________________________________________________________')
# Scatter Plot for hp vs mpg
print('\033[1m\nScatter Plot for hp vs mpg:-')
plt.figure(figsize=(10, 7))
plt.title('\nScatter Plot for hp vs mpg\n')
plt.xlabel('hp')
plt.ylabel('mpg')
plt.scatter(x = cardata['hp'], y = cardata['mpg'], c=Model2.labels_)
plt.show()
Dendogram:-
___________________________________________________________________________________________
Scatter Plot for hp vs mpg:-
Key Observations:-
# Seperating two Clusters
clust1 = Scaled_Data[Scaled_Data['LABELS']==0]
clust2 = Scaled_Data[Scaled_Data['LABELS']==1]
# Segregating Predictors VS Target Attributes
X1 = clust1.drop(columns=['mpg','LABELS'])
X2 = clust2.drop(columns=['mpg','LABELS'])
y1 = clust1['mpg']
y2 = clust2['mpg']
Key Observations:-
# Fitting Linear Regression Model
M1 = LinearRegression().fit(X1, y1)
M2 = LinearRegression().fit(X2, y2)
Coef1 = M1.score(X1, y1)
Coef2 = M2.score(X2, y2)
# Displaying Coefficients of the models individually
print('\033[1m\nCoefficients of the Models:-')
print('\033[1m\n For group 0 =',round(Coef1,3))
print('\033[1m\n For group 1 =',round(Coef2,3))
Coefficients of the Models:- For group 0 = 0.705 For group 1 = 0.68
Key Observations:-
Closing Sentence:- Clustering of Data is done. Clusters are treated as individual data and Regression Models are trained for each data.
DOMAIN: Manufacturing
CONTEXT: Company X curates and packages wine across various vineyards spread throughout the country.
DATA DESCRIPTION: The data concerns the chemical composition of the wine and its respective quality. Attribute Information:
PROJECT OBJECTIVE: Goal is to build a synthetic data generation model using the existing data provided by the company.
Steps and tasks: Design a synthetic data generation model which can impute values [Attribute: Quality] wherever empty the company has missed recording the data.
# Loading Data
data = pd.read_excel('Company.xlsx')
# Getting Shape and Size
shape = data.shape
# Displaying Dataset
print('\033[1mDataset consist:-\033[0m\n Number of Rows =',shape[0],'\n Number of Columns =',shape[1])
print('\033[1m\nDataset:-')
display(data.head(10))
Dataset consist:- Number of Rows = 61 Number of Columns = 5 Dataset:-
| A | B | C | D | Quality | |
|---|---|---|---|---|---|
| 0 | 47 | 27 | 45 | 108 | Quality A |
| 1 | 174 | 133 | 134 | 166 | Quality B |
| 2 | 159 | 163 | 135 | 131 | NaN |
| 3 | 61 | 23 | 3 | 44 | Quality A |
| 4 | 59 | 60 | 9 | 68 | Quality A |
| 5 | 153 | 140 | 154 | 199 | NaN |
| 6 | 34 | 28 | 78 | 22 | Quality A |
| 7 | 191 | 144 | 143 | 154 | NaN |
| 8 | 160 | 181 | 194 | 178 | Quality B |
| 9 | 145 | 178 | 158 | 141 | NaN |
Key Observations:-
# Getting Number of Empty Attributes in Quality
Filled = data['Quality'].value_counts().sum()
Empty = shape[0] - Filled
# Displaying Empty values and Unique values in quality attribute
print('\033[1mDataset Quality Attribute consist:-\033[0m\n Filled =',Filled,'\n Empty =',Empty)
print(' ___________\n Total =',shape[0])
print('\033[1m\nUnique Values in Dataset:-')
display(data['Quality'].value_counts().to_frame())
Dataset Quality Attribute consist:- Filled = 43 Empty = 18 ___________ Total = 61 Unique Values in Dataset:-
| Quality | |
|---|---|
| Quality A | 26 |
| Quality B | 17 |
Key Observations:-
# Applying Z-Scores for Scaling
scaled = data.drop(['Quality'],axis=1)
scaled = scaled.apply(zscore)
# Finding optimal no. of clusters K
MD=[]
for k in range(1,10):
model = KMeans(n_clusters=k)
model.fit(scaled)
MD.append(sum(np.min(cdist(scaled, model.cluster_centers_, 'euclidean'), axis=1)) / scaled.shape[0])
# Displaying plot for Selecting k with the Elbow Method
print('\033[1mPlot for Selecting k with the Elbow Method:-')
plt.plot(range(1,10), MD, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('\nSelecting k with the Elbow Method\n')
plt.show()
Plot for Selecting k with the Elbow Method:-
Key Observations:-
# Fitting the Model with K = 2
model = KMeans(2).fit(scaled)
pred = model.predict(scaled)
# Appending the prediction to our dataset
data['GROUP'] = pred
Key Observations:-
# Analyze the distribution of the data among the Groups
print('\033[1mCompare the clusters with the Existing Target :-')
df = data.groupby(['GROUP'])
display(df.mean())
Compare the clusters with the Existing Target :-
| A | B | C | D | |
|---|---|---|---|---|
| GROUP | ||||
| 0 | 58.75000 | 60.928571 | 49.750000 | 53.000000 |
| 1 | 169.30303 | 163.909091 | 168.666667 | 166.606061 |
Key Observations:-
# Replacing 1s and 0s with respective Qualities to the Dataset
data['GROUP'].replace({1:'Quality A',0:'Quality B'},inplace=True)
data['Quality'] = data['GROUP']
data.drop(['GROUP'],axis=1,inplace=True)
display(data.head())
| A | B | C | D | Quality | |
|---|---|---|---|---|---|
| 0 | 47 | 27 | 45 | 108 | Quality B |
| 1 | 174 | 133 | 134 | 166 | Quality A |
| 2 | 159 | 163 | 135 | 131 | Quality A |
| 3 | 61 | 23 | 3 | 44 | Quality B |
| 4 | 59 | 60 | 9 | 68 | Quality B |
Key Observations:-
# Getting Number of Empty Attributes
Filled = data['Quality'].value_counts().sum()
Empty = shape[0] - Filled
# Displaying Empty values and Unique values in quality attribute
print('\033[1mDataset Quality Attribute consist:-\033[0m\n Filled =',Filled,'\n Empty =',Empty)
print(' ___________\n Total =',shape[0])
print('\033[1m\nUnique Values in Dataset:-')
display(data['Quality'].value_counts().to_frame())
Dataset Quality Attribute consist:- Filled = 61 Empty = 0 ___________ Total = 61 Unique Values in Dataset:-
| Quality | |
|---|---|
| Quality A | 33 |
| Quality B | 28 |
Key Observations:-
print('\033[1mFinal Dataset:-')
display(data.head(10))
Final Dataset:-
| A | B | C | D | Quality | |
|---|---|---|---|---|---|
| 0 | 47 | 27 | 45 | 108 | Quality B |
| 1 | 174 | 133 | 134 | 166 | Quality A |
| 2 | 159 | 163 | 135 | 131 | Quality A |
| 3 | 61 | 23 | 3 | 44 | Quality B |
| 4 | 59 | 60 | 9 | 68 | Quality B |
| 5 | 153 | 140 | 154 | 199 | Quality A |
| 6 | 34 | 28 | 78 | 22 | Quality B |
| 7 | 191 | 144 | 143 | 154 | Quality A |
| 8 | 160 | 181 | 194 | 178 | Quality A |
| 9 | 145 | 178 | 158 | 141 | Quality A |
Closing Sentence:- Synthetic data generation model is built using the existing data provided by the company
DOMAIN: Automobile
CONTEXT: The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette.The vehicle may be viewed from one of many different angles.
DATA DESCRIPTION: The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.
PROJECT OBJECTIVE: Apply dimensionality reduction technique – PCA and train a model using principal components instead of training the model using just the raw data.
Steps and tasks:
EDA and visualisation: Create a detailed performance report using univariate, bi-variate and multivariate EDA techniques. Find out all possible hiddenpatterns by using all possible methods.
For example: Use your best analytical approach to build this report. Even you can mix match columns to create new columns which can be used for better analysis. Create your own features if required. Be highly experimental and analytical here to find hidden patterns.
# Loading Data
vehidata = pd.read_csv('Vehicle.csv')
# Getting Shape and Size
shape = vehidata.shape
# Displaying Dataset
print('\033[1mDataset consist:-\033[0m\n Number of Rows =',shape[0],'\n Number of Columns =',shape[1])
print('\033[1m\nDataset:-')
display(vehidata.head())
Dataset consist:- Number of Rows = 846 Number of Columns = 19 Dataset:-
| compactness | circularity | distance_circularity | radius_ratio | pr.axis_aspect_ratio | max.length_aspect_ratio | scatter_ratio | elongatedness | pr.axis_rectangularity | max.length_rectangularity | scaled_variance | scaled_variance.1 | scaled_radius_of_gyration | scaled_radius_of_gyration.1 | skewness_about | skewness_about.1 | skewness_about.2 | hollows_ratio | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 95 | 48.0 | 83.0 | 178.0 | 72.0 | 10 | 162.0 | 42.0 | 20.0 | 159 | 176.0 | 379.0 | 184.0 | 70.0 | 6.0 | 16.0 | 187.0 | 197 | van |
| 1 | 91 | 41.0 | 84.0 | 141.0 | 57.0 | 9 | 149.0 | 45.0 | 19.0 | 143 | 170.0 | 330.0 | 158.0 | 72.0 | 9.0 | 14.0 | 189.0 | 199 | van |
| 2 | 104 | 50.0 | 106.0 | 209.0 | 66.0 | 10 | 207.0 | 32.0 | 23.0 | 158 | 223.0 | 635.0 | 220.0 | 73.0 | 14.0 | 9.0 | 188.0 | 196 | car |
| 3 | 93 | 41.0 | 82.0 | 159.0 | 63.0 | 9 | 144.0 | 46.0 | 19.0 | 143 | 160.0 | 309.0 | 127.0 | 63.0 | 6.0 | 10.0 | 199.0 | 207 | van |
| 4 | 85 | 44.0 | 70.0 | 205.0 | 103.0 | 52 | 149.0 | 45.0 | 19.0 | 144 | 241.0 | 325.0 | 188.0 | 127.0 | 9.0 | 11.0 | 180.0 | 183 | bus |
Key Observations:-
# Checking for Null Values in the Attributes
print('\n\033[1mNull Values in the Features:-')
display(vehidata.isnull().sum().to_frame('Null Values'))
Null Values in the Features:-
| Null Values | |
|---|---|
| compactness | 0 |
| circularity | 5 |
| distance_circularity | 4 |
| radius_ratio | 6 |
| pr.axis_aspect_ratio | 2 |
| max.length_aspect_ratio | 0 |
| scatter_ratio | 1 |
| elongatedness | 1 |
| pr.axis_rectangularity | 3 |
| max.length_rectangularity | 0 |
| scaled_variance | 3 |
| scaled_variance.1 | 2 |
| scaled_radius_of_gyration | 2 |
| scaled_radius_of_gyration.1 | 4 |
| skewness_about | 6 |
| skewness_about.1 | 1 |
| skewness_about.2 | 1 |
| hollows_ratio | 0 |
| class | 0 |
Key Observations:-
# Dropping Null Values
vehidata.dropna(inplace=True)
# Checking for Null Values After Dropping
print('\n\033[1mNull Values in the Features:-')
display(vehidata.isnull().sum().to_frame('Null Values'))
Null Values in the Features:-
| Null Values | |
|---|---|
| compactness | 0 |
| circularity | 0 |
| distance_circularity | 0 |
| radius_ratio | 0 |
| pr.axis_aspect_ratio | 0 |
| max.length_aspect_ratio | 0 |
| scatter_ratio | 0 |
| elongatedness | 0 |
| pr.axis_rectangularity | 0 |
| max.length_rectangularity | 0 |
| scaled_variance | 0 |
| scaled_variance.1 | 0 |
| scaled_radius_of_gyration | 0 |
| scaled_radius_of_gyration.1 | 0 |
| skewness_about | 0 |
| skewness_about.1 | 0 |
| skewness_about.2 | 0 |
| hollows_ratio | 0 |
| class | 0 |
Key Observations:-
# Checking the Datatypes
print('\n\033[1mData Types of Each Attribute:-')
display(vehidata.dtypes.to_frame('Data Type'))
Data Types of Each Attribute:-
| Data Type | |
|---|---|
| compactness | int64 |
| circularity | float64 |
| distance_circularity | float64 |
| radius_ratio | float64 |
| pr.axis_aspect_ratio | float64 |
| max.length_aspect_ratio | int64 |
| scatter_ratio | float64 |
| elongatedness | float64 |
| pr.axis_rectangularity | float64 |
| max.length_rectangularity | int64 |
| scaled_variance | float64 |
| scaled_variance.1 | float64 |
| scaled_radius_of_gyration | float64 |
| scaled_radius_of_gyration.1 | float64 |
| skewness_about | float64 |
| skewness_about.1 | float64 |
| skewness_about.2 | float64 |
| hollows_ratio | int64 |
| class | object |
Key Observations:-
NOTE:- Here we are Replacing Outliers by Mean of the Attribute without outliers. That is we will calculate Mean without outliers and then replace outliers with this calculated Mean
# Getting Outliers and Imputing Outliers by Mean
AT = []
OL1 = []
OL2 = []
M1 = []
M2 = []
for i in vehidata.columns:
if i!='class':
AT.append(i)
# Getting Interquartile Range
q1 = vehidata[i].quantile(0.25)
q3 = vehidata[i].quantile(0.75)
IQR = q3 - q1
# Getting Mean of Attribute having Outliers (i.e including outliers)
M1.append(round(vehidata[i].mean(),2))
# Getting Outlier and Normal Values Seperated
OL = []
NOL = []
for k in vehidata[i]:
if (k < (q1 - 1.5 * IQR) or k > (q3 + 1.5 * IQR)):
OL.append(k)
else:
NOL.append(k)
OL1.append(len(OL))
# Replacing Outliers by Mean of Normal Values
vehidata[i].replace(OL,np.mean(NOL),inplace=True) # Here we are imputing outliers by Mean of attribute without outlier
# Getting Outliers After Imputation
OL_cnt = 0
for k in vehidata[i]:
if (k < (q1 - 1.5 * IQR) or k > (q3 + 1.5 * IQR)):
OL_cnt += 1
OL2.append(OL_cnt)
# Creting dataframe for better representation of Outlier Analysis
Outlier_Analysis = pd.DataFrame({'Attribute':AT,
'Mean Including Outliers':M1,
'Outliers Before Imputation':OL1,
'Outliers After Imputation':OL2})
print('\033[1mTotal Outliers Observed in Dataset =',sum(OL1))
print('\n\033[1mTable Showing Outlier Analysis:-')
display(Outlier_Analysis)
Total Outliers Observed in Dataset = 57 Table Showing Outlier Analysis:-
| Attribute | Mean Including Outliers | Outliers Before Imputation | Outliers After Imputation | |
|---|---|---|---|---|
| 0 | compactness | 93.66 | 0 | 0 |
| 1 | circularity | 44.80 | 0 | 0 |
| 2 | distance_circularity | 82.04 | 0 | 0 |
| 3 | radius_ratio | 169.10 | 3 | 0 |
| 4 | pr.axis_aspect_ratio | 61.77 | 8 | 0 |
| 5 | max.length_aspect_ratio | 8.60 | 13 | 0 |
| 6 | scatter_ratio | 168.56 | 0 | 0 |
| 7 | elongatedness | 40.99 | 0 | 0 |
| 8 | pr.axis_rectangularity | 20.56 | 0 | 0 |
| 9 | max.length_rectangularity | 147.89 | 0 | 0 |
| 10 | scaled_variance | 188.38 | 1 | 0 |
| 11 | scaled_variance.1 | 438.38 | 2 | 0 |
| 12 | scaled_radius_of_gyration | 174.25 | 0 | 0 |
| 13 | scaled_radius_of_gyration.1 | 72.40 | 15 | 0 |
| 14 | skewness_about | 6.35 | 12 | 0 |
| 15 | skewness_about.1 | 12.69 | 3 | 0 |
| 16 | skewness_about.2 | 188.98 | 0 | 0 |
| 17 | hollows_ratio | 195.73 | 0 | 0 |
Key Observations:-
# Describing the data interms of count, mean, standard deviation, and 5 point summary
print('\n\033[1mBrief Summary of Dataset:-')
display(vehidata.describe()[1:].T)
Brief Summary of Dataset:-
| mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|
| compactness | 93.656827 | 8.233751 | 73.0 | 87.0 | 93.0 | 100.0 | 119.0 |
| circularity | 44.803198 | 6.146659 | 33.0 | 40.0 | 44.0 | 49.0 | 59.0 |
| distance_circularity | 82.043050 | 15.783070 | 40.0 | 70.0 | 79.0 | 98.0 | 112.0 |
| radius_ratio | 168.538272 | 32.322218 | 104.0 | 141.0 | 167.0 | 195.0 | 252.0 |
| pr.axis_aspect_ratio | 61.233540 | 5.639203 | 47.0 | 57.0 | 61.0 | 65.0 | 76.0 |
| max.length_aspect_ratio | 8.133750 | 2.073374 | 3.0 | 7.0 | 8.0 | 10.0 | 13.0 |
| scatter_ratio | 168.563346 | 33.082186 | 112.0 | 146.0 | 157.0 | 198.0 | 265.0 |
| elongatedness | 40.988930 | 7.803380 | 26.0 | 33.0 | 43.0 | 46.0 | 61.0 |
| pr.axis_rectangularity | 20.558426 | 2.573184 | 17.0 | 19.0 | 20.0 | 23.0 | 29.0 |
| max.length_rectangularity | 147.891759 | 14.504648 | 118.0 | 137.0 | 146.0 | 159.0 | 188.0 |
| scaled_variance | 188.215517 | 30.821257 | 130.0 | 167.0 | 179.0 | 216.0 | 288.0 |
| scaled_variance.1 | 436.977805 | 172.969108 | 184.0 | 318.0 | 364.0 | 586.0 | 987.0 |
| scaled_radius_of_gyration | 174.252153 | 32.332161 | 109.0 | 149.0 | 173.0 | 198.0 | 268.0 |
| scaled_radius_of_gyration.1 | 71.887218 | 6.103767 | 59.0 | 67.0 | 71.0 | 75.0 | 87.0 |
| skewness_about | 6.131086 | 4.577902 | 0.0 | 2.0 | 6.0 | 9.0 | 19.0 |
| skewness_about.1 | 12.586420 | 8.770504 | 0.0 | 6.0 | 11.0 | 19.0 | 38.0 |
| skewness_about.2 | 188.979090 | 6.153681 | 176.0 | 184.0 | 189.0 | 193.0 | 206.0 |
| hollows_ratio | 195.729397 | 7.398781 | 181.0 | 191.0 | 197.0 | 201.0 | 211.0 |
# Checking skewness of the data attributes
print('\033[1m\nSkewness of all attributes:-')
display(vehidata.skew().to_frame(name='Skewness'))
Skewness of all attributes:-
| Skewness | |
|---|---|
| compactness | 0.386048 |
| circularity | 0.272723 |
| distance_circularity | 0.114244 |
| radius_ratio | 0.112252 |
| pr.axis_aspect_ratio | 0.151030 |
| max.length_aspect_ratio | 0.075688 |
| scatter_ratio | 0.596913 |
| elongatedness | 0.053941 |
| pr.axis_rectangularity | 0.759483 |
| max.length_rectangularity | 0.271183 |
| scaled_variance | 0.570150 |
| scaled_variance.1 | 0.792170 |
| scaled_radius_of_gyration | 0.266943 |
| scaled_radius_of_gyration.1 | 0.531221 |
| skewness_about | 0.620049 |
| skewness_about.1 | 0.630575 |
| skewness_about.2 | 0.255880 |
| hollows_ratio | -0.229941 |
# Checking Variance data attributes
print('\033[1m\nVariance of all attributes:-')
display(vehidata.var().to_frame(name='Variance'))
Variance of all attributes:-
| Variance | |
|---|---|
| compactness | 67.794649 |
| circularity | 37.781418 |
| distance_circularity | 249.105287 |
| radius_ratio | 1044.725756 |
| pr.axis_aspect_ratio | 31.800609 |
| max.length_aspect_ratio | 4.298878 |
| scatter_ratio | 1094.431019 |
| elongatedness | 60.892734 |
| pr.axis_rectangularity | 6.621274 |
| max.length_rectangularity | 210.384821 |
| scaled_variance | 949.949858 |
| scaled_variance.1 | 29918.312316 |
| scaled_radius_of_gyration | 1045.368607 |
| scaled_radius_of_gyration.1 | 37.255972 |
| skewness_about | 20.957187 |
| skewness_about.1 | 76.921737 |
| skewness_about.2 | 37.867789 |
| hollows_ratio | 54.741955 |
# Checking Covariance related with all attributes
print('\033[1mCovariance between all attributes:-')
display(vehidata.cov())
Covariance between all attributes:-
| compactness | circularity | distance_circularity | radius_ratio | pr.axis_aspect_ratio | max.length_aspect_ratio | scatter_ratio | elongatedness | pr.axis_rectangularity | max.length_rectangularity | scaled_variance | scaled_variance.1 | scaled_radius_of_gyration | scaled_radius_of_gyration.1 | skewness_about | skewness_about.1 | skewness_about.2 | hollows_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| compactness | 67.794649 | 34.915138 | 102.657649 | 191.890805 | 9.013851 | 8.763890 | 221.732972 | -50.633114 | 17.250986 | 80.601971 | 196.094987 | 1160.013639 | 154.778757 | -13.004222 | 7.213691 | 11.812800 | 15.007594 | 22.711209 |
| circularity | 34.915138 | 37.781418 | 77.387793 | 127.520802 | 7.132800 | 7.219940 | 174.500167 | -39.576073 | 13.541069 | 86.067342 | 154.066396 | 898.171024 | 185.935153 | 2.461128 | 3.754326 | 0.116653 | -4.308307 | 2.243477 |
| distance_circularity | 102.657649 | 77.387793 | 249.105287 | 407.822003 | 22.526953 | 22.154515 | 474.635816 | -112.410853 | 36.440216 | 177.066242 | 425.343723 | 2433.981199 | 360.113516 | -23.702933 | 6.842198 | 38.045354 | 14.108044 | 40.080629 |
| radius_ratio | 191.890805 | 127.520802 | 407.822003 | 1044.725756 | 119.531422 | 31.351552 | 834.169972 | -210.036192 | 62.636143 | 272.485857 | 796.741486 | 4339.966946 | 584.670842 | -77.554215 | 4.623453 | 52.023414 | 79.507103 | 117.682594 |
| pr.axis_aspect_ratio | 9.013851 | 7.132800 | 22.526953 | 119.531422 | 31.800609 | 1.701886 | 39.184783 | -13.735376 | 2.544751 | 12.800523 | 39.472649 | 209.110578 | 30.429108 | -10.601513 | -1.464573 | -1.466047 | 13.594577 | 16.968727 |
| max.length_aspect_ratio | 8.763890 | 7.219940 | 22.154515 | 31.351552 | 1.701886 | 4.298878 | 34.909070 | -8.375005 | 2.713388 | 19.697326 | 27.059780 | 174.397310 | 27.811124 | -4.254559 | 0.815054 | 2.480241 | 1.015414 | 6.331938 |
| scatter_ratio | 221.732972 | 174.500167 | 474.635816 | 834.169972 | 39.184783 | 34.909070 | 1094.431019 | -251.289323 | 84.444878 | 387.788885 | 981.754300 | 5644.556586 | 851.147185 | -2.068993 | 8.902511 | 65.451355 | 2.029036 | 33.881695 |
| elongatedness | -50.633114 | -39.576073 | -112.410853 | -210.036192 | -13.735376 | -8.375005 | -251.289323 | 60.892734 | -19.082481 | -87.277062 | -228.409642 | -1287.613958 | -192.142525 | 4.702480 | -1.430347 | -13.592790 | -5.645552 | -13.475906 |
| pr.axis_rectangularity | 17.250986 | 13.541069 | 36.440216 | 62.636143 | 2.544751 | 2.713388 | 84.444878 | -19.082481 | 6.621274 | 30.305593 | 75.288103 | 436.266871 | 65.966161 | 0.097976 | 0.787988 | 5.166606 | -0.275255 | 2.231347 |
| max.length_rectangularity | 80.601971 | 86.067342 | 177.066242 | 272.485857 | 12.800523 | 19.697326 | 387.788885 | -87.277062 | 30.305593 | 210.384821 | 335.515362 | 1989.555369 | 405.768706 | 3.614699 | 8.257237 | 2.077776 | -9.641429 | 9.314270 |
| scaled_variance | 196.094987 | 154.066396 | 425.343723 | 796.741486 | 39.472649 | 27.059780 | 981.754300 | -228.409642 | 75.288103 | 335.515362 | 949.949858 | 5058.721491 | 779.926873 | 0.692479 | 3.045436 | 57.224851 | 3.578138 | 24.165534 |
| scaled_variance.1 | 1160.013639 | 898.171024 | 2433.981199 | 4339.966946 | 209.110578 | 174.397310 | 5644.556586 | -1287.613958 | 436.266871 | 1989.555369 | 5058.721491 | 29918.312316 | 4373.190776 | -17.089730 | 47.705442 | 339.701301 | 23.929776 | 180.845710 |
| scaled_radius_of_gyration | 154.778757 | 185.935153 | 360.113516 | 584.670842 | 30.429108 | 27.811124 | 851.147185 | -192.142525 | 65.966161 | 405.768706 | 779.926873 | 4373.190776 | 1045.368607 | 39.114689 | 23.355755 | -9.923204 | -44.665903 | -24.440304 |
| scaled_radius_of_gyration.1 | -13.004222 | 2.461128 | -23.702933 | -77.554215 | -10.601513 | -4.254559 | -2.068993 | 4.702480 | 0.097976 | 3.614699 | 0.692479 | -17.089730 | 39.114689 | 37.255972 | -1.604584 | -5.871768 | -31.483962 | -40.786057 |
| skewness_about | 7.213691 | 3.754326 | 6.842198 | 4.623453 | -1.464573 | 0.815054 | 8.902511 | -1.430347 | 0.787988 | 8.257237 | 3.045436 | 47.705442 | 23.355755 | -1.604584 | 20.957187 | -0.903254 | 2.306281 | 2.137364 |
| skewness_about.1 | 11.812800 | 0.116653 | 38.045354 | 52.023414 | -1.466047 | 2.480241 | 65.451355 | -13.592790 | 5.166606 | 2.077776 | 57.224851 | 339.701301 | -9.923204 | -5.871768 | -0.903254 | 76.921737 | 3.770647 | 12.186918 |
| skewness_about.2 | 15.007594 | -4.308307 | 14.108044 | 79.507103 | 13.594577 | 1.015414 | 2.029036 | -5.645552 | -0.275255 | -9.641429 | 3.578138 | 23.929776 | -44.665903 | -31.483962 | 2.306281 | 3.770647 | 37.867789 | 40.706157 |
| hollows_ratio | 22.711209 | 2.243477 | 40.080629 | 117.682594 | 16.968727 | 6.331938 | 33.881695 | -13.475906 | 2.231347 | 9.314270 | 24.165534 | 180.845710 | -24.440304 | -40.786057 | 2.137364 | 12.186918 | 40.706157 | 54.741955 |
Key Observations:-
# Plotting Frequency Distribution of categorical attribute
colors = ['gold','tomato','yellowgreen','#ADD8E6']
print(f'\033[1mPlot Showing Frequency Distribution of Attribute Class:-')
plt.figure(figsize=(10,8))
plt.title(f'Frequencies of Class Attribute\n')
sns.countplot(vehidata['class'],palette='bright');
plt.show()
print('\n___________________________________________________________________________________')
print('')
# Plotting Pie Chart to check contribution of categorical attribute
print(f'\033[1m\nPie Chart Showing Contribution of Each Category of Class feature:-\n')
plt.title(f'Contribution of Each Category of Class Attribute\n\n\n\n\n\n')
vehidata['class'].value_counts().plot.pie(radius=2.5,shadow=True,autopct='%1.1f%%',colors=colors);
plt.legend(loc='right',prop={'size': 12}, bbox_to_anchor=(2, 1))
plt.show()
Plot Showing Frequency Distribution of Attribute Class:-
___________________________________________________________________________________
Pie Chart Showing Contribution of Each Category of Class feature:-
Key Observations:-
# Bivariate Analysis for Class VS All Attributes
print(f'\033[1m\nPlots Showing Bivariate Analysis of Class VS All Attributes:-\n')
# Setting up Sub-Plots
fig, axes = plt.subplots(6, 3, figsize=(18, 20))
fig.suptitle(f'Class VS All Attributes')
plt.subplots_adjust(left=0.1,bottom=0.1, right=0.9, top=0.94, wspace=0.3, hspace=0.6)
# Plotting Sub-Plots
sns.violinplot(ax=axes[0, 0], x='class', y='compactness', data=vehidata, palette='bright');
sns.violinplot(ax=axes[0, 1], x='class', y='circularity', data=vehidata, palette='bright');
sns.violinplot(ax=axes[0, 2], x='class', y='distance_circularity', data=vehidata, palette='bright');
sns.violinplot(ax=axes[1, 0], x='class', y='radius_ratio', data=vehidata, palette='bright');
sns.violinplot(ax=axes[1, 1], x='class', y='pr.axis_aspect_ratio', data=vehidata, palette='bright');
sns.violinplot(ax=axes[1, 2], x='class', y='max.length_aspect_ratio', data=vehidata, palette='bright');
sns.violinplot(ax=axes[2, 0], x='class', y='scatter_ratio', data=vehidata, palette='bright');
sns.violinplot(ax=axes[2, 1], x='class', y='elongatedness', data=vehidata, palette='bright');
sns.violinplot(ax=axes[2, 2], x='class', y='pr.axis_rectangularity', data=vehidata, palette='bright');
sns.violinplot(ax=axes[3, 0], x='class', y='max.length_rectangularity', data=vehidata, palette='bright');
sns.violinplot(ax=axes[3, 1], x='class', y='scaled_variance', data=vehidata, palette='bright');
sns.violinplot(ax=axes[3, 2], x='class', y='scaled_variance.1', data=vehidata, palette='bright');
sns.violinplot(ax=axes[4, 0], x='class', y='scaled_radius_of_gyration', data=vehidata, palette='bright');
sns.violinplot(ax=axes[4, 1], x='class', y='scaled_radius_of_gyration.1', data=vehidata, palette='bright');
sns.violinplot(ax=axes[4, 2], x='class', y='skewness_about', data=vehidata, palette='bright');
sns.violinplot(ax=axes[5, 0], x='class', y='skewness_about.1', data=vehidata, palette='bright');
sns.violinplot(ax=axes[5, 1], x='class', y='skewness_about.2', data=vehidata, palette='bright');
sns.violinplot(ax=axes[5, 2], x='class', y='hollows_ratio', data=vehidata, palette='bright');
plt.show()
Plots Showing Bivariate Analysis of Class VS All Attributes:-
# Multivariate Analysis of Attributes
print('\033[1mPlot Showing Multivariate Analysis to check Relation between Attributes:-')
# Plotting pairplot for Attributes
sns.pairplot(vehidata,plot_kws={'color':'#9400D3'},diag_kws={'color':'Gold'})
plt.show()
Plot Showing Multivariate Analysis to check Relation between Attributes:-
# Plotting Heatmap for checking Correlation
print('\033[1mHeatmap showing Correlation of Data attributes:-')
plt.figure(figsize=(22,18))
plt.title('Correlation of Data Attributes\n')
sns.heatmap(vehidata.corr(),annot=True,fmt= '.2f',cmap='Spectral');
plt.show()
Heatmap showing Correlation of Data attributes:-
Key Observations:-
# Seperating Independent and Dependent Attributes
# Getting Predictors by dropping Class Attribute
X = vehidata.drop(columns='class')
# Getting Target Attribute
y = vehidata['class']
# Applying Z-Scores to Predictors
X_Scale = X.apply(zscore)
Key Observations:-
# Checking Value Counts of Target Attribute
print('\033[1mTable Showing Total Observations of class:-')
TAC = y.value_counts().to_frame('Total Observations')
display(TAC)
# Getting Percentages of each category in Target Attribute
labels = ['Car','Bus','Van']
print('\033[1m\n\nPie Chart Showing Percentage of Each Category of Target Attribute:-')
plt.title('Percentage of Each Category of Target Attribute\n\n\n\n\n\n')
explode = (0.05, 0.1, 0.1)
y.value_counts().plot.pie(radius=2,explode=explode,shadow=True,autopct='%1.1f%%',startangle=-90,labels=labels,colors=colors);
plt.legend(loc='right',prop={'size': 12}, bbox_to_anchor=(2.6, 1))
plt.show()
Table Showing Total Observations of class:-
| Total Observations | |
|---|---|
| car | 413 |
| bus | 205 |
| van | 195 |
Pie Chart Showing Percentage of Each Category of Target Attribute:-
Key Observations:-
SMOTE first selects a minority class instance a at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.
# Getting total observations of target attribute before transformation
yct = y.count()
# Transforming the dataset
OS = SMOTE(random_state=1)
X, y = OS.fit_resample(X_Scale, y)
# Checking Value Counts of Target Attribute after transforming
print('\033[1mTable Showing Total Observations in each section of target data for SMOTE:-')
TAC2 = y.value_counts().to_frame('Total Observations')
# For better representation of Transformation
TVC = pd.DataFrame({'Before Tranformation':TAC['Total Observations'],'After Tranformation':TAC2['Total Observations']})
total = pd.Series({'Before Tranformation':yct,'After Tranformation':y.count()},name='Total')
TVC = TVC.append(total)
columns=[('__________Total Observations__________', 'Before Tranformation'), ('__________Total Observations__________',
'After Tranformation')]
TVC.columns = pd.MultiIndex.from_tuples(columns)
display(TVC)
# Getting Percentages of each category in Target Attribute
labels = ['Car','Bus','Van']
print('\033[1m\n\nPie Chart Showing Percentage of Each Category of Target Attribute:-')
plt.title('Percentage of Each Category of Target Attribute\n\n\n\n\n\n')
explode = (0.05, 0.1, 0.1)
y.value_counts().plot.pie(radius=2,explode=explode,shadow=True,autopct='%1.1f%%',startangle=-90,labels=labels,colors=colors);
plt.legend(loc='right',prop={'size': 12}, bbox_to_anchor=(2.6, 1))
plt.show()
Table Showing Total Observations in each section of target data for SMOTE:-
| __________Total Observations__________ | ||
|---|---|---|
| Before Tranformation | After Tranformation | |
| bus | 205 | 413 |
| car | 413 | 413 |
| van | 195 | 413 |
| Total | 813 | 1239 |
Pie Chart Showing Percentage of Each Category of Target Attribute:-
Key Observations:-
# Splitting into Train and Test Sets in
# Here test_size is not given because by default its value is 0.25.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify=y)
# For better observation of Splitted Data
TTS = pd.DataFrame({'Train':y_train.value_counts(),'Test':y_test.value_counts(),'Total Observations':y.value_counts()})
total = pd.Series({'Train':y_train.count(),'Test':y_test.count(),'Total Observations':y.shape[0]},name='Total')
TTS = TTS.append(total)
print('\033[1mTable Showing Train-Test Split of Data:-')
display(TTS)
Table Showing Train-Test Split of Data:-
| Train | Test | Total Observations | |
|---|---|---|---|
| bus | 310 | 103 | 413 |
| car | 309 | 104 | 413 |
| van | 310 | 103 | 413 |
| Total | 929 | 310 | 1239 |
Key Observations:-
# Fitting SVM Classifier
model = SVC(gamma=0.025, C=3)
model.fit(X_train, y_train)
# Getting Accuracy of Test Data
print('\033[1mAccuracy(%)\n Train Data =',round(model.score(X_train, y_train)*100,2),'%')
print('\033[1m Test Data =',round(model.score(X_test, y_test)*100,2),'%')
Accuracy(%) Train Data = 98.17 % Test Data = 96.13 %
Key Observations:-
# Fitting PCA
pca = PCA(n_components=18)
display(pca.fit(X))
PCA(n_components=18)
# Plotting Eigen Value to get dimension
plt.step(list(range(1,19)),np.cumsum(pca.explained_variance_ratio_), where='mid')
plt.bar(list(range(1,19)),pca.explained_variance_ratio_,alpha=0.5, align='center')
plt.ylabel('Variation Explained')
plt.xlabel('Principle Component')
plt.tight_layout()
plt.show()
Key Observations:-
# Fitting PCA
Pca = PCA(n_components=8)
display(Pca.fit(X))
# Transforming Predictors
X_pca = Pca.transform(X)
PCA(n_components=8)
Key Observations:-
# Here test_size is not given because by default its value is 0.25.
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, random_state=1, stratify=y)
# Fitting SVM Classifier
model = SVC(gamma=0.025, C=3)
model.fit(X_train, y_train)
# Getting Accuracy of Test Data
print('\033[1mAccuracy(%)\n Train Data =',round(model.score(X_train, y_train)*100,2),'%')
print('\033[1m Test Data =',round(model.score(X_test, y_test)*100,2),'%')
# Building Confusion Matrix for Naive Bayes Model
test_pred = model.predict(X_test)
CM = metrics.confusion_matrix(y_test, test_pred)
Con_Mat = pd.DataFrame(CM)
# Displaying Confusion Matrix for SVM Model
print('\033[1m\nHeatmap Showing Performance of SVM Model:-')
plt.figure(figsize = (7,5))
sns.heatmap(Con_Mat, annot=True, fmt=".1f")
plt.title('Confusion Matrix of SVM Model\n')
plt.xlabel('\nPredicted Labels\n')
plt.ylabel('Actual Labels\n')
plt.show()
Accuracy(%) Train Data = 97.09 % Test Data = 94.52 % Heatmap Showing Performance of SVM Model:-
Closing Sentence:- Dimension Reduction Technique(PCA) is implemented and Model is Trained using principal components instead of training the model using just the raw data.
DOMAIN: Sports management
CONTEXT: Company X is a sports management company for international cricket.
DATA DESCRIPTION: The data is collected belongs to batsman from IPL series conducted so far. Attribute Information:
PROJECT OBJECTIVE: Goal is to build a data driven batsman ranking model for the sports management company to make business decisions.
Steps and tasks:
# Loading Data
spdata = pd.read_csv('Batting_bowling_ipl_bat.csv')
# Getting Shape and Size
shape = spdata.shape
# Displaying Dataset
print('\033[1mDataset consist:-\033[0m\n Number of Rows =',shape[0],'\n Number of Columns =',shape[1])
print('\033[1m\nDataset:-')
display(spdata.head(10))
Dataset consist:- Number of Rows = 180 Number of Columns = 7 Dataset:-
| Name | Runs | Ave | SR | Fours | Sixes | HF | |
|---|---|---|---|---|---|---|---|
| 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | CH Gayle | 733.0 | 61.08 | 160.74 | 46.0 | 59.0 | 9.0 |
| 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | G Gambhir | 590.0 | 36.87 | 143.55 | 64.0 | 17.0 | 6.0 |
| 4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5 | V Sehwag | 495.0 | 33.00 | 161.23 | 57.0 | 19.0 | 5.0 |
| 6 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 7 | CL White | 479.0 | 43.54 | 149.68 | 41.0 | 20.0 | 5.0 |
| 8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 9 | S Dhawan | 569.0 | 40.64 | 129.61 | 58.0 | 18.0 | 5.0 |
Key Observations:-
# Checking for Null Values in the Attributes
print('\n\033[1mNull Values in the Features:-')
display(spdata.isnull().sum().to_frame('Null Values'))
Null Values in the Features:-
| Null Values | |
|---|---|
| Name | 90 |
| Runs | 90 |
| Ave | 90 |
| SR | 90 |
| Fours | 90 |
| Sixes | 90 |
| HF | 90 |
Key Observations:-
# Dropping Null Values
spdata.dropna(inplace=True)
# Checking for Null Values After Dropping
print('\n\033[1mNull Values in the Features:-')
display(spdata.isnull().sum().to_frame('Null Values'))
Null Values in the Features:-
| Null Values | |
|---|---|
| Name | 0 |
| Runs | 0 |
| Ave | 0 |
| SR | 0 |
| Fours | 0 |
| Sixes | 0 |
| HF | 0 |
Key Observations:-
# Describing the data interms of count, mean, standard deviation, and 5 point summary
print('\n\033[1mBrief Summary of Dataset:-')
display(spdata.describe())
Brief Summary of Dataset:-
| Runs | Ave | SR | Fours | Sixes | HF | |
|---|---|---|---|---|---|---|
| count | 90.000000 | 90.000000 | 90.000000 | 90.000000 | 90.000000 | 90.000000 |
| mean | 219.933333 | 24.729889 | 119.164111 | 19.788889 | 7.577778 | 1.188889 |
| std | 156.253669 | 13.619215 | 23.656547 | 16.399845 | 8.001373 | 1.688656 |
| min | 2.000000 | 0.500000 | 18.180000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 98.000000 | 14.665000 | 108.745000 | 6.250000 | 3.000000 | 0.000000 |
| 50% | 196.500000 | 24.440000 | 120.135000 | 16.000000 | 6.000000 | 0.500000 |
| 75% | 330.750000 | 32.195000 | 131.997500 | 28.000000 | 10.000000 | 2.000000 |
| max | 733.000000 | 81.330000 | 164.100000 | 73.000000 | 59.000000 | 9.000000 |
# Checking skewness of the data attributes
print('\033[1m\nSkewness of all attributes:-')
display(spdata.skew().to_frame(name='Skewness'))
Skewness of all attributes:-
| Skewness | |
|---|---|
| Runs | 0.754618 |
| Ave | 1.038076 |
| SR | -1.166175 |
| Fours | 1.107192 |
| Sixes | 3.226595 |
| HF | 2.001199 |
# Checking Variance data attributes
print('\033[1m\nVariance of all attributes:-')
display(spdata.var().to_frame(name='Variance'))
Variance of all attributes:-
| Variance | |
|---|---|
| Runs | 24415.208989 |
| Ave | 185.483008 |
| SR | 559.632193 |
| Fours | 268.954931 |
| Sixes | 64.021973 |
| HF | 2.851561 |
# Checking Covariance related with all attributes
print('\033[1mCovariance between all attributes:-')
display(spdata.cov())
Covariance between all attributes:-
| Runs | Ave | SR | Fours | Sixes | HF | |
|---|---|---|---|---|---|---|
| Runs | 24415.208989 | 1474.707184 | 1824.142412 | 2354.480150 | 962.409738 | 220.361049 |
| Ave | 1474.707184 | 185.483008 | 200.915584 | 121.997954 | 74.364335 | 14.276201 |
| SR | 1824.142412 | 200.915584 | 559.632193 | 149.292451 | 110.531531 | 17.081012 |
| Fours | 2354.480150 | 121.997954 | 149.292451 | 268.954931 | 68.572784 | 21.703246 |
| Sixes | 962.409738 | 74.364335 | 110.531531 | 68.572784 | 64.021973 | 10.372784 |
| HF | 220.361049 | 14.276201 | 17.081012 | 21.703246 | 10.372784 | 2.851561 |
# Checking Correlation by plotting Heatmap for all attributes
print('\033[1mHeatmap showing Correlation of Data attributes:-')
plt.figure(figsize=(8,6))
plt.title('Correlation of Data Attributes\n')
sns.heatmap(spdata.corr(),annot=True,fmt= '.2f',cmap='magma');
plt.show()
Heatmap showing Correlation of Data attributes:-
Key Observations:-
Univariate analysis is the simplest form of analyzing data. It involves only one variable.
We will use these functions for easy analysis of individual attribute.
def qt_data(x):
# Distribution plot
print(f'\033[1mPlot Showing Distribution of Feature "{x}":-')
plt.figure(figsize=(12,6))
plt.title(f'Distribution of "{x}"\n')
sns.distplot(spdata[x],color='#9400D3');
print('')
plt.show()
print('\n__________________________________________________________________________________________________\n')
print('')
# Box plot for Quantitative data
print(f'\033[1mPlot Showing 5 point summary with outliers of Attribute "{x}":-\n')
plt.figure(figsize=(12,6))
plt.title(f'Box Plot for "{x}"\n')
sns.boxplot(spdata[x],color="#9400D3");
plt.show()
# Univariate analysis fpr Runs Attribute
qt_data('Runs')
Plot Showing Distribution of Feature "Runs":-
__________________________________________________________________________________________________
Plot Showing 5 point summary with outliers of Attribute "Runs":-
# Univariate analysis for Ave Attribute
qt_data('Ave')
Plot Showing Distribution of Feature "Ave":-
__________________________________________________________________________________________________
Plot Showing 5 point summary with outliers of Attribute "Ave":-
# Univariate analysis for Ave Attribute
qt_data('Ave')
Plot Showing Distribution of Feature "Ave":-
__________________________________________________________________________________________________
Plot Showing 5 point summary with outliers of Attribute "Ave":-
# Univariate analysis for SR Attribute
qt_data('SR')
Plot Showing Distribution of Feature "SR":-
__________________________________________________________________________________________________
Plot Showing 5 point summary with outliers of Attribute "SR":-
# Univariate analysis for Fours Attribute
qt_data('Fours')
Plot Showing Distribution of Feature "Fours":-
__________________________________________________________________________________________________
Plot Showing 5 point summary with outliers of Attribute "Fours":-
# Univariate analysis for Sixes Attribute
qt_data('Sixes')
Plot Showing Distribution of Feature "Sixes":-
__________________________________________________________________________________________________
Plot Showing 5 point summary with outliers of Attribute "Sixes":-
# Plotting Frequency Distribution of categorical attribute
colors = ['gold','tomato','yellowgreen','pink','red','#ADD8E6','green']
print(f'\033[1mPlot Showing Frequency Distribution of Attribute HF:-')
plt.figure(figsize=(10,8))
plt.title(f'Frequencies of Class Attribute\n')
sns.countplot(spdata['HF'],palette='bright');
plt.show()
print('\n___________________________________________________________________________________')
print('')
# Plotting Pie Chart to check contribution of categorical attribute
print(f'\033[1m\nPie Chart Showing Contribution of Each Category of HF feature:-\n')
plt.title(f'Contribution of Each Category of HF Attribute\n\n\n\n\n\n')
spdata['HF'].value_counts().plot.pie(radius=2.5,shadow=True,autopct='%1.1f%%',colors=colors);
plt.legend(loc='right',prop={'size': 12}, bbox_to_anchor=(2, 1))
plt.show()
Plot Showing Frequency Distribution of Attribute HF:-
___________________________________________________________________________________
Pie Chart Showing Contribution of Each Category of HF feature:-
# Bivariate Analysis for HF VS All Attributes
print(f'\033[1m\nPlots Showing Bivariate Analysis of HF VS All Attributes:-\n')
# Setting up Sub-Plots
fig, axes = plt.subplots(3, 2, figsize=(14, 12))
fig.suptitle(f'HF VS All Attributes')
plt.subplots_adjust(left=0.1,bottom=0.1, right=0.9, top=0.94, wspace=0.3, hspace=0.4)
# Plotting Sub-Plots
sns.violinplot(ax=axes[0, 0], x='HF', y='Runs', data=spdata, palette='bright');
sns.violinplot(ax=axes[0, 1], x='HF', y='Ave', data=spdata, palette='bright');
sns.violinplot(ax=axes[1, 0], x='HF', y='SR', data=spdata, palette='bright');
sns.violinplot(ax=axes[1, 1], x='HF', y='Fours', data=spdata, palette='bright');
sns.violinplot(ax=axes[2, 0], x='HF', y='Sixes', data=spdata, palette='bright');
plt.show()
Plots Showing Bivariate Analysis of HF VS All Attributes:-
# Multivariate Analysis of Attributes
print('\033[1mPlot Showing Multivariate Analysis to check Relation between Attributes:-')
# Plotting pairplot for Attributes
sns.pairplot(spdata,plot_kws={'color':'#9400D3'},diag_kws={'color':'Gold'}).fig.suptitle('Relation between Attributes',
y=1.04);
plt.show()
Plot Showing Multivariate Analysis to check Relation between Attributes:-
Multivariate Analysis : To check Density of Categorical Attribute in all other Attributes
# Multivariate Analysis to check Density of Categorical Attribute
print('\033[1mPlot Showing Multivariate Analysis to check Density of Categorical Attribute:-')
sns.pairplot(spdata,hue='HF',palette='bright').fig.suptitle('Density of Categorical Attribute',y=1.04);
plt.show()
Plot Showing Multivariate Analysis to check Density of Categorical Attribute:-
Multivariate Analysis : To Check Correlation
# Plotting Heatmap for checking Correlation
print('\033[1mHeatmap showing Correlation of Data attributes:-')
plt.figure(figsize=(12,10))
plt.title('Correlation of Data Attributes\n')
sns.heatmap(spdata.corr(),annot=True,fmt= '.2f',cmap='flare');
plt.show()
Heatmap showing Correlation of Data attributes:-
NOTE:- Here we are Replacing Outliers by Mean of the Attribute without outliers. That is we will calculate Mean without outliers and then replace outliers with this calculated Mean
# Getting Outliers and Imputing Outliers by Mean
AT = []
OL1 = []
OL2 = []
M1 = []
M2 = []
for i in spdata.columns:
if i!='Name':
AT.append(i)
# Getting Interquartile Range
q1 = spdata[i].quantile(0.25)
q3 = spdata[i].quantile(0.75)
IQR = q3 - q1
# Getting Mean of Attribute having Outliers (i.e including outliers)
M1.append(round(spdata[i].mean(),2))
# Getting Outlier and Normal Values Seperated
OL = []
NOL = []
for k in spdata[i]:
if (k < (q1 - 1.5 * IQR) or k > (q3 + 1.5 * IQR)):
OL.append(k)
else:
NOL.append(k)
OL1.append(len(OL))
# Replacing Outliers by Mean of Normal Values
spdata[i].replace(OL,np.mean(NOL),inplace=True) # Here we are imputing outliers by Mean of attribute without outlier
M2.append(round(np.mean(NOL),2))
# Getting Outliers After Imputation
OL_cnt = 0
for k in spdata[i]:
if (k < (q1 - 1.5 * IQR) or k > (q3 + 1.5 * IQR)):
OL_cnt += 1
OL2.append(OL_cnt)
# Creting dataframe for better representation of Outlier Analysis
Outlier_Analysis = pd.DataFrame({'Attribute':AT,
'Mean Including Outliers':M1,
'Outliers Before Imputation':OL1,
'Mean Excluding Outliers':M2,
'Outliers After Imputation':OL2})
print('\033[1mTotal Outliers Observed in Dataset =',sum(OL1))
print('\n\033[1mTable Showing Outlier Analysis:-')
display(Outlier_Analysis)
Total Outliers Observed in Dataset = 15 Table Showing Outlier Analysis:-
| Attribute | Mean Including Outliers | Outliers Before Imputation | Mean Excluding Outliers | Outliers After Imputation | |
|---|---|---|---|---|---|
| 0 | Runs | 219.93 | 1 | 214.17 | 0 |
| 1 | Ave | 24.73 | 3 | 23.24 | 0 |
| 2 | SR | 119.16 | 5 | 123.02 | 0 |
| 3 | Fours | 19.79 | 3 | 18.17 | 0 |
| 4 | Sixes | 7.58 | 1 | 7.00 | 0 |
| 5 | HF | 1.19 | 2 | 1.05 | 0 |
Key Observations:-
# Dropping Name Attribute
X = spdata.drop(columns='Name')
# Applying Z-Scores to Predictors
X = X.apply(zscore)
# Fitting PCA
pca1 = PCA(n_components=6)
display(pca1.fit(X))
PCA(n_components=6)
# Plotting Eigen Value to get dimension
plt.step(list(range(1,7)),np.cumsum(pca1.explained_variance_ratio_), where='mid')
plt.bar(list(range(1,7)),pca1.explained_variance_ratio_,alpha=0.5, align='center')
plt.ylabel('Variation Explained')
plt.xlabel('Principle Component')
plt.tight_layout()
plt.show()
Key Observations:-
# Fitting PCA
PCa = PCA(n_components=4)
display(PCa.fit(X))
# Transforming Predictors
X_PCA = PCa.transform(X)
PCA(n_components=4)
Key Observations:-
# Converting X_PCA to dataframe
TD = pd.DataFrame(X_PCA)
# Descending Sorting the data
TD.sort_values(by=0, ascending=False, inplace=True)
# Getting index values
index = TD.index
Key Observations:-
# Re-setting index values
spdata = spdata.reset_index()
spdata = spdata.drop(columns='index')
# Re-indexing
spdata = spdata.reindex(index)
Key Observations:-
# Finalised Sorted Players
print('\033[1m\nFinalised Sorted Players:-')
display(spdata.head(10))
Finalised Sorted Players:-
| Name | Runs | Ave | SR | Fours | Sixes | HF | |
|---|---|---|---|---|---|---|---|
| 2 | V Sehwag | 495.0 | 33.00 | 161.23 | 57.000000 | 19.0 | 5.000000 |
| 4 | S Dhawan | 569.0 | 40.64 | 129.61 | 58.000000 | 18.0 | 5.000000 |
| 3 | CL White | 479.0 | 43.54 | 149.68 | 41.000000 | 20.0 | 5.000000 |
| 7 | RG Sharma | 433.0 | 30.92 | 126.60 | 39.000000 | 18.0 | 5.000000 |
| 5 | AM Rahane | 560.0 | 40.00 | 129.33 | 18.172414 | 10.0 | 5.000000 |
| 8 | AB de Villiers | 319.0 | 39.87 | 161.11 | 26.000000 | 15.0 | 3.000000 |
| 1 | G Gambhir | 590.0 | 36.87 | 143.55 | 18.172414 | 17.0 | 1.045455 |
| 12 | F du Plessis | 398.0 | 33.16 | 130.92 | 29.000000 | 17.0 | 3.000000 |
| 10 | DA Warner | 256.0 | 36.57 | 164.10 | 28.000000 | 14.0 | 3.000000 |
| 13 | OA Shah | 340.0 | 37.77 | 132.81 | 24.000000 | 16.0 | 3.000000 |
Key Observations:-
Closing Sentence:- Data driven batsman ranking model is built based on performance for the sports management company to make business decisions.
Questions:
1. Principal Component Analysis: PCA is a technique which helps us in extracting a new set of variables from an existing large set of variables. These newly extracted variables are called Principal Components.
2. Random Forest: Random Forest is one of the most widely used algorithms for feature selection. It comes packaged with in-built feature importance so you don’t need to program that separately. This helps us select a smaller subset of features.
3. Missing Value Ratio: Feature Selection plays a key role in reducing the dimensions of any dataset. There are various benefits of dimensionality reduction including reduced computational/training time of a dataset, lesser dimensions lead to better visualization, etc. And Missing Value Ratio is one of the basic feature selection techniques.
4. Low Variance Filter: Low Variance Filter is a useful dimensionality reduction algorithm. The variance is a statistical measure of the amount of variation in the given variable. If the variance is too low, it means that it does not change much and hence it can be ignored.
5. High Correlation Filter: This dimensionality reduction algorithm tries to discard inputs that are very similar to others. If there is a very high correlation between two input variables, we can safely drop one of them.
6. Forward Feature Selection: Forward feature selection starts with the evaluation of each individual feature, and selects that which results in the best performing selected algorithm model.
7. Backward Feature Elimination: Backward elimination is a feature selection technique while building a machine learning model. It is used to remove those features that do not have a significant effect on the dependent variable or prediction of output.
8. Independent Component Analysis: Independent Component Analysis (ICA) extracts hidden factors within data by transforming a set of variables to a new set that is maximally independent.
9. Factor Analysis: Factor analysis is a technique that is used to reduce a large number of variables into fewer numbers of factors. This technique extracts maximum common variance from all variables and puts them into a common score.
10. t-Distributed Stochastic Neighbor Embedding (t-SNE): t-Distributed Stochastic Neighbor Embedding (t-SNE) is a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets.
11. Uniform Manifold Approximation and Projection (UMAP) UMAP is a dimension reduction technique that can preserve as much of the local, and more of the global data structure as compared to t-SNE, with a shorter runtime.
12.Random Projection (RP): In RP, a higher dimensional data is projected onto a lower-dimensional subspace using a random matrix whose columns have unit length.
13. Singular Value Decomposition (SVD): Reducing the number of input variables for a predictive model is referred to as dimensionality reduction. Fewer input variables can result in a simpler predictive model that may have better performance when making predictions on new data.
14. Semantic Analysis (LSA): LSA learns latent topics by performing a matrix decomposition on the document-term matrix using Singular value decomposition. LSA is typically used as a dimension reduction or noise reducing technique.
Lets talk about PCA dimension reduction technique
PCA is a dimensionality reduction that is often used to reduce the dimension of the variables of a larger dataset that is compressed to the smaller one which contains most of the information to build an efficient model. The idea of PCA is to reduce the variables in the dataset and preserve data as much as possible.
The image is a combination of pixels in rows placed one after another to form one single image each pixel value represents the intensity value of the image, so if you have multiple images we can form a matrix considering a row of pixels as a vector. It requires huge amounts of storage while working with many images where we are using PCA is used to compress it and preserve the data as much as possible.
# Loading the Image
image = mplib.imread('coffee.jpg')
# Displaying the Image
print('\033[1mPlot for displaying the Image:-')
plt.figure(figsize=(12,9)).suptitle('Coffee Image',y=0.9);
plt.imshow(image)
plt.show()
Plot for displaying the Image:-
Key Observations:-
# Shape and Dimension of Image
print('\033[1mShape of the Image :-',image.shape)
print('\033[1mDimension of the Image :-',len(image.shape))
Shape of the Image :- (667, 1000, 3) Dimension of the Image :- 3
Key Observations:-
# Re-shaping the Image
image_rs = np.reshape(image, (1000, 667*3))
# Displaying Shape and Dimension of the Image After Re-shaping
print('\033[1mShape of Re-shaped Image :-',image_rs.shape)
print('\033[1mDimension of Re-shaped Image :-',len(image_rs.shape))
# Displaying Image after Re-shaping
print('\033[1m\nPlot for displaying Re-shaped Image:-')
plt.figure(figsize=(12,9)).suptitle('Re-Shaped Coffee Image',y=0.9);
plt.imshow(image_rs)
plt.show()
Shape of Re-shaped Image :- (1000, 2001) Dimension of Re-shaped Image :- 2 Plot for displaying Re-shaped Image:-
Key Observations:-
# Fitting PCA and Transforming the Image
pca = PCA(30).fit(image_rs) # Here we choose the dimension of image must reduce to 30
Transformed_img = pca.transform(image_rs)
# Recovering Image
Image = pca.inverse_transform(Transformed_img )
# Displaying Shape and Dimension of Recovered Image
print('\033[1mShape of the Transformed Image :-',Transformed_img.shape)
print('\033[1mShape of the Recovered Image :-',Image.shape)
# Displaying Recovered Image
print('\033[1m\nPlot for displaying Compressed Image:-')
plt.figure(figsize=(12,9)).suptitle('PCA Compressed Coffee Image',y=0.9);
plt.imshow(Image)
plt.show()
Shape of the Transformed Image :- (1000, 30) Shape of the Recovered Image :- (1000, 2001) Plot for displaying Compressed Image:-
Key Observations:-
# Comparing Original Image with PCA Compressed Image
print('\033[1mPlot showing Comparision of Original Image VS PCA Compressed Image:-\n')
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 8))
fig.suptitle('Original Image VS PCA Compressed Image',y=0.75)
fig.set_figwidth(25)
fig.set_figheight(15)
ax1.title.set_text('Original Image')
ax1.imshow(image)
ax2.title.set_text('PCA Compressed Image')
ax2.imshow(Image)
plt.show()
Plot showing Comparision of Original Image VS PCA Compressed Image:-
Key Observations:-
Closing Sentence:- Dimensional Reduction Technique has been Illustrated on a Multimedia Data Image